A simple interpreter from scratch in Python (part 2)
In the previous article in the series, we covered the IMP language and the general structure of the interpreter we are building for it. We also covered the lexer in depth. In this article, we will write a small parser combinator library. This library will be used to create the IMP parser, which extracts an abstract syntax tree (AST) from the list of tokens generated by the lexer. The combinator library is language agnostic, so you could use it to write a parser for any language.
What are parser combinators?
There are many, many ways to build a parser. Combinators are probably the easiest and fastest way to get a parser up and running.
You can think of a parser as a function. It accepts a stream of tokens as input. If successful, the parser will consume some tokens from the stream. It will return part of the final AST, along with the remaining tokens. A combinator is a function which produces a parser as its output, usually after taking one or more parsers as input, hence the name "combinator". You can use combinators to create a complete parser for a language like IMP by creating lots of smaller parsers for parts of the language, then using combinators to build the final parser.
A minimal combinator library
Parser combinators are fairly generic, and can be used with any language. As we did with the lexer, we'll start by writing a language agnostic library of combinators, then use that to write our IMP parser.
First, let's define a Result
class. Every parser will return a Result
object on success or None
on failure. A Result
comprises a value
(part of the AST), and a position (the index of the next token in the stream).
class Result: def __init__(self, value, pos): self.value = value self.pos = pos def __repr__(self): return 'Result(%s, %d)' % (self.value, self.pos)
Next, we'll define a base class, Parser
. Earlier, I said parsers are functions which take a stream of tokens as input. We will actually be defining parsers as objects with a __call__
method. This means that a parser object will behave as if it were a function, but we can also provide additional functionality by defining some operators.
class Parser: def __call__(self, tokens, pos): return None # subclasses will override this def __add__(self, other): return Concat(self, other) def __mul__(self, other): return Exp(self, other) def __or__(self, other): return Alternate(self, other) def __xor__(self, function): return Process(self, function)
The method that actually does the parsing is __call__
. The input is the full list of tokens (returned by the lexer) and an index into the list indicating the next token. The default implementation always returns None
(failure). Subclasses of Parser
will provide their own __call__
implementation.
The other methods, __add__
, __mul__
, __or__
, and __xor__
define the +
, *
, |
, and ^
operators, respectively. Each operator provides a shortcut for calling a different combinator. We'll cover each of these combinators shortly.
Next, we'll look at the simplest combinator, Reserved
. Reserved
will be used to parse reserved words and operators; it will accept tokens with a specific value and tag. Remember, tokens are just value-tag pairs. token[0]
is the value, token[1]
is the tag.
class Reserved(Parser): def __init__(self, value, tag): self.value = value self.tag = tag def __call__(self, tokens, pos): if pos < len(tokens) and \ tokens[pos][0] == self.value and \ tokens[pos][1] is self.tag: return Result(tokens[pos][0], pos + 1) else: return None
At this point, you might stop and say, "I thought combinators were going to be functions returning parsers. This subclass doesn't look like a function." It is like a function though, if you think of a constructor as a function which returns an object (which in this case also happens to be callable). Subclassing is an easy way to define a new combinator since all we need to do is provide a constructor and a __call__
method, and we still retain other functionality (like the overloaded operators).
Moving on, the Tag
combinator is very similar to Reserved
. It matches any token which has a particular tag. The value can be anything.
class Tag(Parser): def __init__(self, tag): self.tag = tag def __call__(self, tokens, pos): if pos < len(tokens) and tokens[pos][1] is self.tag: return Result(tokens[pos][0], pos + 1) else: return None
The Tag
and Reserved
combinators are our primitives. All combinators will be built out of them at the most basic level.
The Concat
combinator will take two parsers as input (left and right). When applied, a Concat
parser will apply the left parser, followed by the right parser. If both are successful, the result value will be a pair containing the left and right results. If either parser is unsuccessful, None
is returned.
class Concat(Parser): def __init__(self, left, right): self.left = left self.right = right def __call__(self, tokens, pos): left_result = self.left(tokens, pos) if left_result: right_result = self.right(tokens, left_result.pos) if right_result: combined_value = (left_result.value, right_result.value) return Result(combined_value, right_result.pos) return None
Concat
is useful for parsing specific sequences of tokens. For instance, to parse 1 + 2
, you could write:
parser = Concat(Concat(Tag(INT), Reserved('+', RESERVED)), Tag(INT))
or more concisely using the +
operator shorthand:
parser = Tag(INT) + Reserved('+', RESERVED) + Tag(INT)
The Alternate
combinator is similar. It also takes left and right parsers. It starts by applying the left parser. If successful, that result is returned. If unsuccessful, it applies the right parser and returns its result.
class Alternate(Parser): def __init__(self, left, right): self.left = left self.right = right def __call__(self, tokens, pos): left_result = self.left(tokens, pos) if left_result: return left_result else: right_result = self.right(tokens, pos) return right_result
Alternate
is useful for choosing among several possible parsers. For instance, if we wanted to parse any binary operator:
parser = Reserved('+', RESERVED) | Reserved('-', RESERVED) | Reserved('*', RESERVED) | Reserved('/', RESERVED)
Opt
is useful for optional text, such as the else-clause of an if-statement. It takes one parser as input. If that parser is successful when applied, the result is returned normally. If it fails, a successful result is still returned, but the value of that result is None
. No tokens are consumed in the failure case; the result position is the same as the input position.
class Opt(Parser): def __init__(self, parser): self.parser = parser def __call__(self, tokens, pos): result = self.parser(tokens, pos) if result: return result else: return Result(None, pos)
Rep
applies its input parser repeatedly until it fails. This is useful for generating lists of things. Note that Rep
will successfully match an empty list and consume no tokens if its parser fails the first time it's applied.
class Rep(Parser): def __init__(self, parser): self.parser = parser def __call__(self, tokens, pos): results = [] result = self.parser(tokens, pos) while result: results.append(result.value) pos = result.pos result = self.parser(tokens, pos) return Result(results, pos)
Process
is a useful combinator which allows us to manipulate result values. Its input is a parser and a function. When the parser is applied successfully, the result value is passed to the function, and the return value from the function is returned instead of the original value. We will use Process
to actually build the AST nodes out of the pairs and lists that Concat
and Rep
return.
class Process(Parser): def __init__(self, parser, function): self.parser = parser self.function = function def __call__(self, tokens, pos): result = self.parser(tokens, pos) if result: result.value = self.function(result.value) return result
As an example, consider the parser we built with Concat
. When it parses 1 + 2
, the result value we actually get back is (('1', '+'), '2')
, which is not very useful. With Process
we can change that result. For example, the following parser would sum the parsed expression.
def process_func(parsed): ((l, _), r) = parsed return int(l) + int(r) better_parser = parser ^ process_func
Lazy
is a less obviously useful combinator. Instead of taking an input parser, it takes a zero-argument function which returns a parser. Lazy
will not call the function to get the parser until it's applied. This is needed to build recursive parsers (like how arithmetic expressions can include arithmetic expressions). Since such a parser refers to itself, we can't just define it by referencing it directly; at the time the parser's defining expression is evaluated, the parser is not defined yet. We would not need this in a language with lazy evaluation like Haskell or Scala, but Python is anything but lazy.
class Lazy(Parser): def __init__(self, parser_func): self.parser = None self.parser_func = parser_func def __call__(self, tokens, pos): if not self.parser: self.parser = self.parser_func() return self.parser(tokens, pos)
The next combinator, Phrase
, takes a single input parser, applies it, and returns its result normally. The only catch is that it will fail if its input parser did not consume all of the remaining tokens. The top level parser for IMP will be a Phrase
parser. This prevents us from partially matching a program which has garbage at the end.
class Phrase(Parser): def __init__(self, parser): self.parser = parser def __call__(self, tokens, pos): result = self.parser(tokens, pos) if result and result.pos == len(tokens): return result else: return None
The last combinator is unfortunately the most complicated. Exp
is fairly specialized; it's used to match an expression which consists of a list of elements separated by something. Here's an example with compound statements:
a := 10; b := 20; c := 30
In this case, we have a list of statements separated by semicolons. You might think that we don't need Exp
since we can do the same thing with the other combinators. We could just write a parser for the compound statement like this:
def compound_stmt(): return stmt() + Reserved(';', RESERVED) + stmt()
Think about how stmt
might be defined though.
def stmt(): return Lazy(compound_stmt) | assign_stmt()
So stmt
starts by calling compound_stmt
, which starts by calling stmt
. These parsers will call each other over and over without getting anything done until we get a stack overflow. This problem is not limited to compound statements; arithmetic and Boolean expressions have the same problem (think about operators like +
or and
as the separators). This problem is called left recursion, and parser combinators are bad at it.
Fortunately, Exp
provides a way to work around left recursion, just by matching a list, similar to the way Rep
does. Exp
takes two parsers as input. The first parser matches the actual elements of the list. The second matches the separators. On success, the separator parser must return a function which combines elements parsed on the left and right into a single value. This value is accumulated for the whole list, left to right, and is ultimately returned.
Let's look at the code:
class Exp(Parser): def __init__(self, parser, separator): self.parser = parser self.separator = separator def __call__(self, tokens, pos): result = self.parser(tokens, pos) def process_next(parsed): (sepfunc, right) = parsed return sepfunc(result.value, right) next_parser = self.separator + self.parser ^ process_next next_result = result while next_result: next_result = next_parser(tokens, result.pos) if next_result: result = next_result return result
result
will always contain everything that's been parsed so far. process_next
is a function that can be used with Process
. next_parser
will apply separator
followed by parser
to get the next element of the list. process_next
will create a new result by applying the separator function to the current result
and the newly parsed element. next_parser
is applied in a loop until it can't match any more elements.
Let's see how Exp
can be used to solve our compound_stmt
problem.
def assign_stmt(): ... def compound_sep(): def process_sep(parsed): return lambda l, r: CompoundStmt(l, r) return Reserved(';', RESERVED) ^ process_sep def compound_stmt(): return Exp(assign_stmt(), compound_sep())
We could also write this as:
def compound_stmt(): return assign_stmt() * compound_sep()
We'll go over this in much more detail when we cover parsing of arithmetic expressions in the next article.
To be continued...
In This article, we built a minimal parser combinator library. This library can be used to write a parser for pretty much any computer language.
In the next article, we'll define the data structures that make up the abstract syntax tree for IMP, and we'll define a parser using this library.
If you're curious to try the IMP interpreter right now, or if you want to see the full source code, you can download it now.